draft: Add batch processing for Postgres sync optimization#442
draft: Add batch processing for Postgres sync optimization#442phernandez wants to merge 1 commit intomainfrom
Conversation
Implements streaming batch processing to reduce database roundtrips from 50K-80K to ~4K-6K for large projects (10K files). **Phase 1: Scan Optimization** - Add entity_repository.get_by_file_paths_batch() for bulk entity fetching - Reduces scan phase from N queries to 1 batched query - Impact: 427 files scanned with 2 queries vs 427 before **Phase 2: Batch Infrastructure** - Add sync_batch_size config (default: 100 files per batch) - Add chunks() utility for streaming batch processing - Add entity_repository.upsert_entities() for bulk inserts/updates - Add observation_repository.delete_by_entity_ids() for batch deletes - Add relation_repository.delete_outgoing_relations_from_entities() for batch deletes **Phase 3: Sync Phase Optimization** - Add sync_markdown_batch() method with 3-phase processing: 1. Parse all files in batch (no DB operations) 2. Bulk upsert entities in single transaction 3. Post-process relations, checksums, search indexing per file - Update new/modified file loops to use batch processing - Add exception handling for circuit breaker and fatal errors - Separate markdown/regular file processing in batches **Test Updates** - Update circuit breaker tests to work with batch architecture - Change mocks from sync_markdown_file to sync_markdown_batch - Update fatal error test to mock upsert_entities - All circuit breaker tests passing (8/8) **Expected Performance** - Initial bulk import: ~10-15 queries/file (vs 43 before) - Incremental sync: Massive scan improvement + batch upsert benefits - Handles both new files and existing files efficiently Addresses N+1 query patterns and transaction overhead with remote Postgres databases while maintaining circuit breaker functionality and proper error handling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Claude finished @phernandez's task —— View job PR Review Complete ✅I've completed a thorough review of this batch processing optimization PR against the Basic Memory team checklist. Here's my detailed analysis: Code Quality & Standards ✅Positives:
Architecture Compliance:
Code Quality Notes:
Testing & Documentation ✅Test Coverage:
Test Quality:
Documentation:
Missing Documentation:
Basic Memory Architecture ✅Atomic, Composable Design:
Database Consistency:
Local-First Preservation:
Knowledge Graph Integrity:
Security & Performance ✅Security:
Performance:
Performance Considerations:
Compatibility ✅Windows Compatibility:
Cross-Platform:
Issues & RecommendationsMinor Issues
Code Quality ObservationsExcellent Patterns:
Potential Optimization:
SummaryThis is high-quality production code that demonstrates:
Checklist ResultsCode Quality & Standards ✅
Testing & Documentation
|
Summary
Implements streaming batch processing to dramatically reduce database roundtrips for Postgres sync operations. Reduces queries from 50,000-80,000 down to 4,000-6,000 for large projects (10K files).
Problem
Remote Postgres databases (like NeonDB) showed severe performance degradation:
Solution
Three-phase optimization approach:
Phase 1: Scan Optimization
Phase 2: Batch Infrastructure
sync_batch_sizeconfig (default: 100 files per batch)chunks()utilityupsert_entities(),delete_by_entity_ids(),delete_outgoing_relations_from_entities()Phase 3: Sync Phase Optimization
sync_markdown_batch()method:Performance Impact
Expected improvement:
Test Coverage
Test Plan
🤖 Generated with Claude Code